Skip to content

Comments

sketch out sync codecs + threadpool#3715

Open
d-v-b wants to merge 29 commits intozarr-developers:mainfrom
d-v-b:perf/faster-codecs
Open

sketch out sync codecs + threadpool#3715
d-v-b wants to merge 29 commits intozarr-developers:mainfrom
d-v-b:perf/faster-codecs

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 18, 2026

This is a work in progress with all the heavy lifting done by claude. The goal is to improve the performance of our codecs by avoiding overhead in to_thread and other async machinery. At the moment we have deadlocks in some of the array tests, but I am opening this now as a draft to see if the benchmarks show anything promising.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026
@d-v-b d-v-b added benchmark Code will be benchmarked in a CI job. and removed needs release notes Automatically applied to PRs which haven't added release notes labels Feb 18, 2026
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 18, 2026

Merging this PR will improve performance by ×5

⚡ 50 improved benchmarks
✅ 6 untouched benchmarks
⏩ 6 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 1,031.6 ms 270.8 ms ×3.8
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 554.3 ms 181.7 ms ×3.1
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None] 1,551.5 ms 684.4 ms ×2.3
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 2,111.7 ms 791.6 ms ×2.7
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None] 5.5 s 1.8 s ×3.1
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip] 9.7 s 2.6 s ×3.7
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 1,204.9 ms 552.4 ms ×2.2
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None] 5.5 s 1.8 s ×3.1
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip] 9.7 s 2.6 s ×3.7
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None] 2.7 s 1.3 s ×2
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, 10, None), slice(None, 10, None), slice(None, 10, None))-memory] 1,831.3 µs 662.2 µs ×2.8
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 278.1 ms 66.7 ms ×4.2
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 1,315 ms 532.1 ms ×2.5
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 1,631.2 ms 639.4 ms ×2.6
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip] 6 s 1.4 s ×4.2
WallTime test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 619.7 ms 143.9 ms ×4.3
WallTime test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None] 2,886.8 ms 604.5 ms ×4.8
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory] 419.2 ms 99.4 ms ×4.2
WallTime test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 952.5 ms 228.6 ms ×4.2
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 3.2 s 1.5 s ×2.2
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing d-v-b:perf/faster-codecs (9d77ca5) with main (f8b3d38)

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@@ -0,0 +1,228 @@
# Design: Fully Synchronous Read/Write Bypass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabernat @dcherian have a look, this is claude's summary of the perf blockers addressed in this PR

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2026

performance impact ranges from "good" to "amazing" so I think we want to learn from this PR. IMO this is NOT a merge candidate but rather should function as a proof-of-concept for what we can get if we rethink our current codec API.

Some key points:

  • Wrapping CPU-bound routines like gzip encode / decode with async adds needless latency. We get a lot of perf by using a sync fast path whenever possible. We need to bake this "sync is faster when available" lesson into both our codec API and store API. For example, there is no reason that reading or writing to an in-memor dict should be async.
  • We should design the chunk encoding process so that IO bound and CPU-bound routines are logically separated in the codebase. That means modelling sharding as a codec is probably wrong. Sharding is declared as a codec in array metadata, but we don't need to model it as a codec internally. Sharding changes how we do IO, but it should not change when we do IO.
  • I haven't looked at memory use at all. that's probably a separate effort.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2026

the current performance improvements are without any parallelism. I'm adding that now.

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 19, 2026

the latest commit adds thread-based parallelism to the synchronous codec pipeline. we compute an estimated compute cost based on the chunk size, codecs, and operation (encode / code), and use that estimate to choose a parellelism strategy, ranging from no threads to full use of a thread pool.

@d-v-b d-v-b marked this pull request as ready for review February 20, 2026 15:19
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 20, 2026

marking this as not a draft, because I think we should actually consider merging it.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 20, 2026
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 20, 2026

i added a changelog entry and made a breaking change: removal of the batch_size parameter from the BatchedCodecPipeline. The batch size was already limited by the concurrency limit, and the parallelism model offered by batch_size (applying codec X over batch_size chunks at once) doesn't really make performance sense versus parallelism across chunks.

@dcherian
Copy link
Contributor

This is extremely hard to review at the moment. Can we look at a new PR with just one affected codec (Zstd?) please?

@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 20, 2026

the changes here aren't really made at the granularity of a single codec. We have new codec pipeline behavior, which requires new methods on stores AND codecs. When the codec pipeline identifies that all the codecs AND the store support the fast path, then it uses the fast path. So breaking that apart is difficult.

return await asyncio.to_thread(
as_numpy_array_wrapper, GZip(self.level).decode, chunk_bytes, chunk_spec.prototype
)
return await asyncio.to_thread(self._decode_sync, chunk_bytes, chunk_spec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an aside, these to_thread calls are extremely annoying; they run on an independent thread pool not the one Zarr sets up (and are thus unconstrained by any config setting).

instead we need something like this:
https://github.com/earth-mover/xpublish-tiles/blob/1a800e05617d609098bbcd1a1f5ac9bbdcb531aa/src/xpublish_tiles/lib.py#L147-L152

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes to_thread has serious problems: python/cpython#136084. I will drop in your async_run idea!

Comment on lines 89 to 115
_CODEC_DECODE_NS_PER_BYTE: dict[str, float] = {
# Near-zero cost — just reshaping/copying/checksumming
"BytesCodec": 0,
"Crc32cCodec": 0,
"TransposeCodec": 0,
"VLenUTF8Codec": 0,
"VLenBytesCodec": 0,
# Medium cost — fast C codecs, GIL released
"ZstdCodec": 1,
"BloscCodec": 0.5,
# High cost — slower C codecs, GIL released
"GzipCodec": 8,
}

_CODEC_ENCODE_NS_PER_BYTE: dict[str, float] = {
# Near-zero cost — just reshaping/copying/checksumming
"BytesCodec": 0,
"Crc32cCodec": 0,
"TransposeCodec": 0,
"VLenUTF8Codec": 0,
"VLenBytesCodec": 0,
# Medium cost — fast C codecs, GIL released
"ZstdCodec": 3,
"BloscCodec": 2,
# High cost — slower C codecs, GIL released
"GzipCodec": 50,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian here's the estimated cost of running each codec in the encode and decode path

Copy link
Contributor

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we adjust work estimates based on codec parameters?

"VLenUTF8Codec": 0,
"VLenBytesCodec": 0,
# Medium cost — fast C codecs, GIL released
"ZstdCodec": 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adjust by the compression level.? Compression level level -1000 is different compression level 22 in terms of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we could put this in the model. we would have to take some data first of course

_MIN_CHUNK_NBYTES_FOR_POOL = 100_000 # 100 KB


def _choose_workers(n_chunks: int, chunk_nbytes: int, codecs: Iterable[Codec]) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be def _use_thread_pool(...)->bool instead?


def _get_pool(max_workers: int) -> ThreadPoolExecutor:
"""Get a thread pool with at most *max_workers* threads."""
def _get_pool() -> ThreadPoolExecutor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to see why this had to change but... i"m not opposed to it.

"""Get the module-level thread pool, creating it lazily."""
global _pool
if _pool is None:
max_workers: int = config.get("threading.codec_workers").get("max") or os.cpu_count() or 4
Copy link
Contributor

@dcherian dcherian Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated in _choose_workers, doesn't donfig have a way to do runtime defaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a user can still unset the param in the config even if we have a default, so we need to handle unsetness. i consolidated this logic into a single function so at least the duplication is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants